NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Every Breath You Don’t Take: Deepfake Speech Detection Using Breath

Layton, Seth; DeAndrade, Thiago; Olszewski, Daniel; Warren, Kevin; Butler, Kevin; Traynor, Patrick; Gates, Carrie (September 2025, ACM Digital Threats: Research and Practice (DTRAP))

Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec and Codecfake models and show that these complex deep learning model completely either fail to classify the same in-the-wild samples (0.72 AUPRC and 0.89 EER), or substantially lack in the computational and temporal performance compared to our methodology (37 seconds to predict a one minute sample with Codecfake vs. 0.3 seconds with our model)
more » « less
Free, publicly-accessible full text available September 1, 2026
Characterizing the Impact of Audio Deepfakes in the Presence of Cochlear Implant

https://doi.org/10.14722/ndss.2025.241117

Pasternak, Magdalena; Warren, Kevin; Olszewski, Daniel; Nittrouer, Susan; Traynor, Patrick; Butler, Kevin (January 2025, Internet Society)

Cochlear implants (CIs) allow deaf and hard-ofhearing individuals to use audio devices, such as phones or voice assistants. However, the advent of increasingly sophisticated synthetic audio (i.e., deepfakes) potentially threatens these users. Yet, this population’s susceptibility to such attacks is unclear. In this paper, we perform the first study of the impact of audio deepfakes on CI populations. We examine the use of CI-simulated audio within deepfake detectors. Based on these results, we conduct a user study with 35 CI users and 87 hearing persons (HPs) to determine differences in how CI users perceive deepfake audio. We show that CI users can, similarly to HPs, identify text-to-speech generated deepfakes. Yet, they perform substantially worse for voice conversion deepfake generation algorithms, achieving only 67% correct audio classification. We also evaluate how detection models trained on a CI-simulated audio compare to CI users and investigate if they can effectively act as proxies for CI users. This work begins an investigation into the intersection between adversarial audio and CI users to identify and mitigate threats against this marginalized group.
more » « less
Full Text Available
"I Had Sort of a Sense that I Was Always Being Watched...Since I Was": Examining Interpersonal Discomfort From Continuous Location-Sharing Applications

https://doi.org/10.1145/3658644.3690342

Childs, Kevin; Gibson, Cassidy; Crowder, Anna; Warren, Kevin; Stillman, Carson; Redmiles, Elissa M; Jain, Eakta; Traynor, Patrick; Butler, Kevin_R B (December 2024, ACM)

Full Text Available
"Better Be Computer or I'm Dumb": A Large-Scale Evaluation of Humans as Audio Deepfake Detectors

https://doi.org/10.1145/3658644.3670325

Warren, Kevin; Tucker, Tyler; Crowder, Anna; Olszewski, Daniel; Lu, Allison; Fedele, Caroline; Pasternak, Magdalena; Layton, Seth; Butler, Kevin; Gates, Carrie; et al (December 2024, ACM)

Audio deepfakes represent a rising threat to trust in our daily communications. In response to this, the research community has developed a wide array of detection techniques aimed at preventing such attacks from deceiving users. Unfortunately, the creation of these defenses has generally overlooked the most important element of the system - the user themselves. As such, it is not clear whether current mechanisms augment, hinder, or simply contradict human classification of deepfakes. In this paper, we perform the first large-scale user study on deepfake detection. We recruit over 1,200 users and present them with samples from the three most widely-cited deepfake datasets. We then quantitatively compare performance and qualitatively conduct thematic analysis to motivate and understand the reasoning behind user decisions and differences from machine classifications. Our results show that users correctly classify human audio at significantly higher rates than machine learning models, and rely on linguistic features and intuition when performing classification. However, users are also regularly misled by pre-conceptions about the capabilities of generated audio (e.g., that accents and background sounds are indicative of humans). Finally, machine learning models suffer from significantly higher false positive rates, and experience false negatives that humans correctly classify when issues of quality or robotic characteristics are reported. By analyzing user behavior across multiple deepfake datasets, our study demonstrates the need to more tightly compare user and machine learning performance, and to target the latter towards areas where humans are less likely to successfully identify threats.
more » « less
Full Text Available
SoK: The Good, The Bad, and The Unbalanced: Measuring Structural Limitations of Deepfake Media Datasets

Layton, Seth; Tucker, Tyler; Olszewski, Daniel; Warren, Kevin; Butler, Kevin; Traynor, Patrick (August 2024, USENIX Security Symposium)

Full Text Available
SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

https://doi.org/10.1109/SP40001.2021.00014

Abdullah, Hadi; Warren, Kevin; Bindschaedler, Vincent; Papernot, Nicholas; Traynor, Patrick (April 2021, Proceedings of the IEEE Symposium on Security and Privacy)
null (Ed.)
Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable to attacks using manipulated inputs. However, as we demonstrate in this paper, the end-to-end architecture of speech and speaker systems and the nature of their inputs make attacks and defenses against them substantially different than those in the image space. We demonstrate this first by systematizing existing research in this space and providing a taxonomy through which the community can evaluate future work. We then demonstrate experimentally that attacks against these models almost universally fail to transfer. In so doing, we argue that substantial additional work is required to provide adequate mitigations in this space.
more » « less
Full Text Available
Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems

https://doi.org/10.1109/SP40001.2021.00009

Abdullah, Hadi; Rahman, M.; Garcia, Washington; Warren, Kevin; Yadav, Anurag; Shrimpton, Thomas; Traynor, Patrick (April 2021, Proceedings of the IEEE Symposium on Security and Privacy)
null (Ed.)
Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box, transferable, can be tuned to require zero queries to the target, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy.
more » « less
Full Text Available

Search for: All records